| sepal length | sepal width | petal length | petal width | species | |
|---|---|---|---|---|---|
| 0 | 4.8 | 3.1 | 1.6 | 0.2 | 0 |
| 1 | 6.5 | 2.8 | 4.6 | 1.5 | 1 |
| 2 | 5.4 | 3.9 | 1.7 | 0.4 | 0 |
| ... | ... | ... | ... | ... | ... |
| 147 | 7.7 | 3.0 | 6.1 | 2.3 | 2 |
| 148 | 7.7 | 3.8 | 6.7 | 2.2 | 2 |
| 149 | 4.6 | 3.2 | 1.4 | 0.2 | 0 |
150 rows × 5 columns
Iris dataset example
Machine Learning (ML)
Feeding data into a computer algorithm in order to learn patterns and make predictions in new and different situations.
ML Model
Computer object implementing a ML algorithm, trained on a set of data to perform a given task.
The four main categories of ML are based on the type of dataset used to train the model:
| Supervised | Unsupervised | Semi-supervised | Reinforcement | |
|---|---|---|---|---|
| Input | Data | Data | Data | Environment |
| Ground-truth | Yes | No | Partial | No (reward) |
| Examples | Classification, Regression | Clustering | Anomaly detection | Game playing |
Another way to categorize ML models is based on the type of output they produce:
| Category | Description | Example Outputs | Example Use Cases |
|---|---|---|---|
| Classification | Assign one (or multiple) label(s) chosen from a given list of classes to each element of the input. | “Cat”, “Dog”, “Bird” | Spam detection, Image recognition |
| Regression | Assign one (or multiple) value(s) chosen from a continuous set of values. | 3.5, 7.2, 15.8 | Stock price prediction, Age estimation |
| Clustering | Create categories by grouping together similar inputs. | Cluster 1, Cluster 2 | Customer segmentation, Image compression |
| Anomaly Detection | Detect outliers in the dataset. | Normal, Outlier | Fraud detection, Fault detection |
| Generative Models | Generate new data similar to the training data. | Image, Text, Audio | Image generation, Text completion |
| Ranking | Arrange items in order of relevance or importance. | Rank 1, Rank 2, Rank 3 | Search engine, Recommendation system |
| Reinforcement Learning | Learn a policy to maximize long-term rewards through interaction with an environment. | Policy, Action sequence | Game playing, Robotics control |
| Dimensionality Reduction | Reduce the number of features while retaining meaningful information. | 2D or 3D projection | Visualization, Data compression |
Dataset
A collection of data used to train, validate and test ML models.
Dataset example
| sepal length | sepal width | petal length | petal width | species | |
|---|---|---|---|---|---|
| 0 | 4.8 | 3.1 | 1.6 | 0.2 | 0 |
| 1 | 6.5 | 2.8 | 4.6 | 1.5 | 1 |
| 2 | 5.4 | 3.9 | 1.7 | 0.4 | 0 |
| ... | ... | ... | ... | ... | ... |
| 147 | 7.7 | 3.0 | 6.1 | 2.3 | 2 |
| 148 | 7.7 | 3.8 | 6.7 | 2.2 | 2 |
| 149 | 4.6 | 3.2 | 1.4 | 0.2 | 0 |
150 rows × 5 columns
Iris dataset example
Instance (or sample)
An instance is one individual entry of the dataset (a row).
Feature (or attribute or variable)
A feature is a piece of information that the model uses to make predictions.
Label (or target or output or class)
A label is a piece of information that the model is trying to predict.
Feature vs. Label
Features and labels are simply different columns in the dataset with different roles.
Instances, features and labels
| Feature 1 | Feature 2 | Feature 3 | Feature 4 | Label | |
|---|---|---|---|---|---|
| Instance 0 | 4.8 | 3.1 | 1.6 | 0.2 | 0 |
| Instance 1 | 6.5 | 2.8 | 4.6 | 1.5 | 1 |
| Instance 2 | 5.4 | 3.9 | 1.7 | 0.4 | 0 |
| ... | ... | ... | ... | ... | ... |
| Instance 147 | 7.7 | 3.0 | 6.1 | 2.3 | 2 |
| Instance 148 | 7.7 | 3.8 | 6.7 | 2.2 | 2 |
| Instance 149 | 4.6 | 3.2 | 1.4 | 0.2 | 0 |
150 rows × 5 columns
Instance, feature and label in the Iris dataset
Dataset subsets
A ML dataset is usually subdivided into three disjoint subsets, with distinctive role in the training process:
A tree-like structure used for both classification and regression
An ensemble method that combines multiple decision trees:
Used for classification and regression, effective in high-dimensional spaces:
Other methods for supervised learning include:
| Method | Description |
|---|---|
| Linear Regression | Predicts a continuous value with a linear model |
| Logistic Regression | Predicts a binary value with a linear model |
| K-Nearest Neighbors (KNN) | Non-parametric method for classification and regression |
| Boosting | Ensemble method (like Random Forests) that combines weak learners to form a strong model |
| Naive Bayes | Probabilistic classifier based on Bayes’ theorem |
A method for partitioning data into \(k\) clusters:
Builds a hierarchy of clusters using either agglomerative or divisive methods:
Clustering based on the density of data points:
Other methods for unsupervised learning include:
| Method | Description |
|---|---|
| Gaussian Mixture Models (GMM) | Probabilistic clustering assuming data is generated from multiple Gaussian distributions |
| Autoencoders | Neural networks that learn efficient representations of data in an unsupervised manner |
| Self-Organizing Maps (SOM) | Neural network-based method for clustering and visualization |
PCA (Principal Component Analysis) is a dimensionality reduction technique to project data into lower dimensions:
t-SNE (t-Distributed Stochastic Neighbor Embedding) is a nonlinear dimensionality reduction technique primarily used for visualization of high-dimensional data
Neural Network (NN)
Subtype of ML model inspired from brains. Composed of several interconnected layers of nodes capable of processing and passing information.
Deep Learning (DL)
Subcategory of Machine Learning. Consists in using large NN models (i.e. with a high number of layers) to solve complex problems.
Structure of a NN
The most basic layer, in which each output is a linear combination of each input (before the activation layer).
Fully connected layer
The output of a dense layer is computed as follows:
\[ \require{colortbl} \underbrace{ \left[ \begin{array}{c} \rowcolor{#ffd077} x_1 & x_2 & x_3 & x_4 \end{array} \right] }_{\text{Input}} \cdot \underbrace{ \left[ \begin{array}{c} \columncolor{#ffd077} w_{11} & w_{12} & w_{13} \\ w_{21} & w_{22} & w_{23} \\ w_{31} & w_{32} & w_{33} \\ w_{41} & w_{42} & w_{43} \end{array} \right] }_{\text{Weights}} = \underbrace{ \begin{bmatrix} \columncolor{#ffd077} y_1 & y_2 & y_3 \end{bmatrix} }_{\text{Output}} \]
with \(y_1 = x_1 w_{11} + x_2 w_{21} + x_3 w_{31} + x_4 w_{41}\)
A layer combining geographically close features, used a lot to process rasters.
The output of a convolutional layer is computed as follows:
\[ \require{colortbl} \underbrace{ \left[ \begin{array}{c} \cellcolor{#ffd077}x_{11} & \cellcolor{#ffd077}x_{12} & x_{13} \\ \cellcolor{#ffd077}x_{21} & \cellcolor{#ffd077}x_{22} & x_{23} \\ x_{31} & x_{32} & x_{33} \end{array} \right] }_{\text{Input}} \ast \underbrace{ \left[ \begin{array}{c} \cellcolor{#ffd077}w_{11} & \cellcolor{#ffd077}w_{12} \\ \cellcolor{#ffd077}w_{21} & \cellcolor{#ffd077}w_{22} \end{array} \right] }_{\text{Kernel/Weights}} = \underbrace{ \begin{bmatrix} \cellcolor{#ffd077}y_{11} & y_{12} \\ y_{21} & y_{22} \end{bmatrix} }_{\text{Output}} \]
with \(y_{11} = x_{11} w_{11} + x_{12} w_{12} + x_{21} w_{21} + x_{22} w_{22}\)
Type of layers designed to process sequential data such as text, time series data, speech or audio. Works by combining input data and the state of the previous time step.
The two main variants of recurrent layers are:
Nowadays, transformer architectures are however preferred to process sequential data.
A type of layers used to reduce the number of features by merging multiple features into one. There are multiple kinds of pooling layers, the most simple ones being Maximum Pooling and Average Pooling.
Max Pooling Example[10]
| Type | Description |
|---|---|
| Residual | Skips some layers to improve training |
| Attention | Focuses on specific parts of the input data |
| Embedding | Transforms discrete data into continuous vectors |
| Dropout | Randomly drop out some of the nodes during training to reduce overfitting |
| Batch Normalization | Normalizes the input of each layer across the batch to improve training stability and speed |
| Layer Normalization | Normalizes the input of each layer across the features to improve training stability and speed |
| Embedding | Transforms discrete input data into continuous vectors with lower-dimensional space |
| Flatten | Convert multi-dimensional data into 1D data that can be fed into fully connected layers |
A NN is defined by both its architecture and its weights
Two architectures with the same weights but doing very different things:
Both the architecture and the weights are needed
When importing a model, you need to import both the architecture and the weights. The process varies depending on the library used.
This also means that the having access to the architecture is not enough to reproduce the model.
What do we want?
We want a model that performs well on a given task. To achieve this, we need to:
What are the right weights?
The right weights are the ones that allow the model to perform well on the task. More precisely:
Loss function
Function that evaluates the performance of a model. In supervised learning, it compares the predictions of the model to the ground-truth.
General idea
Since there is a huge number of possible combinations of weights, we need to search for the right ones. This process is called training. It is iterative and consists of the repetition of three steps:
The process is repeated until the model performs well on a validation set.
Multiple sources of issues and steps to perform:
No prior knowledge
A priori all features have the same importance, so none of them should have an advantage. Therefore, having features with larger values than others would be detrimental.
Usually, all features are individually normalized over the whole dataset, to obtain a distribution with an average of 0 and a standard deviation of 1:
\[ \begin{align*} \hat{X} & = \sum\limits_{j=0}^n X_j \\ \sigma_X & = \sum\limits_{j=0}^n (X_j - \hat{X})^2 \\ \forall k \in [0, \cdots, n ], X_k & = \frac{X_k - \hat{X}}{\sigma_X} \end{align*} \]
Most common evaluation criteria for classification tasks:
| Name | Use case | Formula |
|---|---|---|
| Accuracy | Balanced datasets | \(\frac{TP + TN}{TP + TN + FP + FN}\) |
| Precision | False positives are costly | \(\frac{TP}{TP + FP}\) |
| Recall | False negatives are costly | \(\frac{TP}{TP + FN}\) |
| F1-Score | Unbalanced class distribution | \(\frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}\) |
| … | … | … |
Most common evaluation criteria for regression tasks:
| Name | Use case | Formula |
|---|---|---|
| Mean Absolute Error (MAE) | Robust to outliers | \(\frac{1}{n} \sum\limits_{i=1}^n |y_i - \hat{y}_i|\) |
| Mean Square Error (MSE) | Sensitive to large errors | \(\frac{1}{n} \sum\limits_{i=1}^n (y_i - \hat{y}_i)^2\) |
| … | … | … |
Cross-validation
Method to estimate the real performance of the model:
Cross-validation[11]
Once the data is preprocessed, the model is selected, the hyperparameters chosen and optimized, the final model can be trained (potentially multiple times to keep the best one).
Quality of the data is obviously crucial to train well-performing models. Quality encompasses multiple aspects:
Diversity is the most important aspect of a dataset because ML models are great at generalizing but bad at guessing in new scenarios. There are different aspects to diversity to keep in mind:
Biased
Refers to a model which always makes the same kind of wrong predictions in similar cases.
In practice, a model trained on biased data will most of the time repeat the biased results. This can have major consequences and shouldn’t be underestimated: even a cold-hearted ML algorithm is not objective if it wasn’t trained on objectively chosen and annotated data.
There exist model architectures, training and evaluation methods to try to prevent and detect biases. They can sometimes allow to build unbiased models using biased data. But this adds complexity to the training process and doesn’t always work.
Underfitting
When a model is too simple to properly extract information from a complex task. Can also be explained by key information missing in the input features.
Overfitting
When a model is too complex to properly generalize to new data. Happens often when a NN is trained too long on a dataset that is not diverse enough and learns the noise in the data.
| Lever | In case of underfitting | In case of overfitting |
|---|---|---|
| Complexity | Increase | Reduce |
| Number of features | Increase | Reduce |
| Regularization | Reduce | Increase |
| Training time | Increase | Reduce |
General strategies:
Interpretable
Qualifies a ML model which decision-making process is straightforward and transparent, making it directly understandable by humans. This requires to restrict the model complexity.
Explainable
Qualifies a ML model which decision-making process can be partly interpreted afterwards using post hoc interpretation techniques. These techniques are often used on models which are too complex to be interpreted.
Introduction to Machine Learning